Skip to content

refactor: rai_bench#517

Merged
maciejmajek merged 45 commits intodevelopmentfrom
jm/refactor/rai_bench
Apr 23, 2025
Merged

refactor: rai_bench#517
maciejmajek merged 45 commits intodevelopmentfrom
jm/refactor/rai_bench

Conversation

@jmatejcz
Copy link
Copy Markdown
Contributor

@jmatejcz jmatejcz commented Apr 9, 2025

Purpose

Unify folder structure and naming of benchmarks
Refactor tool_calling bench as there are conflicts between branches:
https://github.com/RobotecAI/rai/tree/mk/feat/spatial-reasoning-tasks
https://github.com/RobotecAI/rai/tree/jm/feat/tool-benchmark-custom-interfaces
which both changed the tool calling benchmark and were done in a hurry. That resulted in a lot of conflicts and not the best code.

Merge and unify taks that are already on development with spatial, navigation and custom interfaces tasks.
Improve and unify logging and saving results

This PR is big in size, please follow below changes descriptions. Firstly focus on 1st and 2nd points. New frame is essential in this PR, it dictates how validation is executed and a lot of other changes in this PR are adjusted to this frame.

Related PRs and branches will be also closed when this PR is merged.

⚠️ Warning: This PR does not provide full benchmark, that is ready to test models. Only sample tasks were created. The action mocks/tests are also missing as I decided to do this in separate PRs.

Proposed Changes

  1. Restructured package:
  • all naming unified, now there the benchmarks: manipulation_o3de and tool_calling_agent. Every folder related to the benchmark will have exactly that name.

  • all code providing framework for creating benchmark in adequate folder

  • all code related to specific benchmark implementation in examples/ folder

  • experiment logs and results in experiments/ folder

  • files/folders responsible for interfaces, tasks, benchmarks etc. named the same across benchmarks

           ├── rai_bench
           │ ├── examples/
           │ ├── experiments/
           │ ├── tool_calling_agent/
           │ ├── manipulation_o3de/
    

REST OF THE POINTS ARE FOR TOOL CALLING AGENT BENCHMARK:
2. New frame for tool_calling_agent benchmark:

  • SubTask - smallest block, responsible for validating single tool call (ex. ListTopics)
  • Validator - consists of subtasks. Based on the validator type, checks whether all subtasks were completed in a certain way
  • Task - consists of validators. Every Validator can be treated as single step that is scored atomically. Visit examples/tool_calling_agent/tasks.py for more intuition. Every Task has always the same prompt and available tools, only the validation methods can be parametrized. On top of validator, you can pass extra_tool_calls param to allow model to correct itself.

refactor_bench(3)

  1. Migrated, Refactored and Unified tool_calling_agent benchmark tasks:
    tasks
  1. Mocks
  • Mocks of tools (mostly imported from above mentioned branches)
  • Refactored action mocks, but here more work is needed - will be continued in this issue -> Action mocks #526
  1. Models
    models
  • Pydantic models that reflect messages from ros2, which enables validation
  1. Unit tests
    tests
    for subtasks and validators

  2. New GetInterfaceTool
    old version didnt return types of fields

  1. Results and logs
  • (Unchanged) all logs are logged to benhcmark.log files
  • Result file is intended to be source of info for further processing
    structure of results:
    image
    results now will have list of validators that will show what is expected by every validator
    followed by passed list which holds bool for every validator
    followed by score(which is redundant with passed, but its weird to not have score in results xd, if you have ideas here, please share)
    followed by errors which is list of lists, where every validator has its own list of errors.
  1. Args passed when running tool calling agent benchmark
    User can pass 2 args when running benchmark - model_name and vendor

  2. Small docs validation as i wanted to paste image. This docs is small for now but i guess there will be docs for benchmarks in the future anyway.

  3. Script to test multiple models, different benchamrks or couple repeats in one go https://github.com/RobotecAI/rai/blob/jm/refactor/rai_bench/src/rai_bench/rai_bench/examples/test_models.py

Issues

515
related PRs and branches:
#493
#487
mk/feat/tool-calling-bench-navigation-tasks

Testing

Test single

python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py --model-name llama3.2 --vendor ollama

tests:

pytest tests/rai_bench/

script running benchmarks:

python src/rai_bench/rai_bench/examples/test_models.py 

next steps

@jmatejcz
Copy link
Copy Markdown
Contributor Author

jmatejcz commented Apr 9, 2025

@MagdalenaKotynia @maciejmajek please take a look, if you like this reconstruction of package structure and the new frame. If yes i will proceed with applying this refactor to other tasks and changes from 2 conflicted branches.

The refactor is not completed yet, i only applied the new frame to 2 task as an example, so don't pay attention to the not touched parts of code

@jmatejcz jmatejcz force-pushed the jm/refactor/rai_bench branch 2 times, most recently from 2e189d9 to 38e64ef Compare April 9, 2025 14:54
@jmatejcz
Copy link
Copy Markdown
Contributor Author

How to log errors when extra calls passed? for example in 3 calls there are errors but in 4th agent done it correctly, should we log the previous 3, even if validator passed eventually?

trace_id=str(run_id),
name="tool calls result",
value=float(success),
value=float(score),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Score is already float, so you don't need to convert it to float.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed here: ef8ae2f

run_id=run_id,
key="tool calls result",
score=float(success),
score=float(score),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed here: ef8ae2f


done_properly = 0
for validator in self.validators:
if_success, remaining_tool_calls = validator.validate(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about is_success?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean naming?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

for arg_name, arg_value in expected_args.items():
if arg_name in tool_call["args"]:
if tool_call["args"][arg_name] != arg_value:
SubTaskValidationError(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SubTaskValidationError(
raise SubTaskValidationError(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied here: ef8ae2f

f"Expected argument '{arg_name}' should have value '{arg_value}', but got '{tool_call['args'][arg_name]}'"
)
else:
SubTaskValidationError(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SubTaskValidationError(
raise SubTaskValidationError(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied here: ef8ae2f

if arg_name not in expected_args:
# If this argument is not required, check if it's an allowed optional argument
if not expected_optional_args or arg_name not in expected_optional_args:
SubTaskValidationError(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SubTaskValidationError(
raise SubTaskValidationError(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied here: ef8ae2f

# If optional argument has expected value, check if the value is correct
elif expected_optional_args[arg_name]:
if expected_optional_args[arg_name] != arg_value:
SubTaskValidationError(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SubTaskValidationError(
raise SubTaskValidationError(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied here: ef8ae2f

except StopIteration:
return True, tool_calls[i:]

self.log_error(msg="Failed to validate")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest logging less generic message with more info about what tool failed to validate

Copy link
Copy Markdown

@MagdalenaKotynia MagdalenaKotynia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed new structure overall looks good to me.
Please remember about handling strict/not strict validation. Currently, both implemented validators seem to be strict and not allow agent to self-correct. Please also remember to handle tool call args that may have allowed value in some range and situations when value does not matter.
FYI I didn't do code review, I just reviewed the structure, but I left some comments regarding the code when I noticed something by the way.

@jmatejcz
Copy link
Copy Markdown
Contributor Author

jmatejcz commented Apr 11, 2025

The proposed new structure overall looks good to me. Please remember about handling strict/not strict validation. Currently, both implemented validators seem to be strict and not allow agent to self-correct. Please also remember to handle tool call args that may have allowed value in some range and situations when value does not matter. FYI I didn't do code review, I just reviewed the structure, but I left some comments regarding the code when I noticed something by the way.

i think you you misunderstood the validators-task relation. You can make validation strict or less strict by adjusting subtasks passed to validators and extra_tool_calls param, please look at rai_bench/examples/tool_calling_agent/tasks.py.

also look at the pic of framework:
#515

@jmatejcz jmatejcz force-pushed the jm/refactor/rai_bench branch 5 times, most recently from b1b6889 to 89a3a1e Compare April 16, 2025 10:09
@jmatejcz jmatejcz marked this pull request as ready for review April 16, 2025 10:23
Copy link
Copy Markdown
Member

@maciejmajek maciejmajek Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answering your question, please rename the file to interface_parser.py and move it to the generic folder.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied here:
678047a

@jmatejcz
Copy link
Copy Markdown
Contributor Author

moved docs to rai_bench: bdc58e6

@jmatejcz
Copy link
Copy Markdown
Contributor Author

saving errors from extra tool calls: 010fc5d

@jmatejcz
Copy link
Copy Markdown
Contributor Author

removed default empty dicts from subtasks and improved error messages in validators:
710864f

@jmatejcz
Copy link
Copy Markdown
Contributor Author

@maciejmajek we speaked about setting recursion limit to number of tool calls required in task, but 1 step is just one node execution, not a tool call, so restricting it like this does not make much sense (Tool calls are in AIMessages).
image

I've set the recursion limit to 4*(required_tool_calls+extra_tool_calls), changed agent.invoke to stream which lets us collect messages from agent even when recursion limit occurs: 0cddef4

@jmatejcz jmatejcz force-pushed the jm/refactor/rai_bench branch from ce0979b to 103d22f Compare April 17, 2025 10:50
@jmatejcz
Copy link
Copy Markdown
Contributor Author

jmatejcz commented Apr 17, 2025

added python script that lets user run multiple models on different benchmarks in one go:
ce0979b
Also modified model initialization to pass only model name and vendor.

Modified instructions in README : e6953e3
and in PR description on how to use it.

@jmatejcz jmatejcz force-pushed the jm/refactor/rai_bench branch 2 times, most recently from e6953e3 to 82723c7 Compare April 17, 2025 11:43
@jmatejcz jmatejcz requested a review from maciejmajek April 17, 2025 11:48
jmatejcz and others added 26 commits April 22, 2025 09:40
Co-authored-by: Magdalena Kotynia <magdalena.kotynia@robotec.ai>
Co-authored-by: Magdalena Kotynia <magdalena.kotynia@robotec.ai>
fixes to subtasks
import fix to old tests
deleted unused code
adjusted error messages in validators
gathering tool calls even when recurrsion limit occurs
refactor running benchmarks to take args
added model initilization via model name
@jmatejcz jmatejcz force-pushed the jm/refactor/rai_bench branch from c80ed2e to 3fb30ca Compare April 22, 2025 07:40
Comment on lines +163 to +197
def get_llm_model_direct(
model_name: str,
vendor: str,
config_path: Optional[str] = None,
**kwargs: Any,
) -> ChatOpenAI | ChatBedrock | ChatOllama:
config = load_config(config_path)
model_config = getattr(config, vendor)

logger.info(f"Initializing Model: {model_name}, Vendor: {vendor}")
if vendor == "openai":
from langchain_openai import ChatOpenAI

model_config = cast(OpenAIConfig, model_config)

return ChatOpenAI(model=model_name, base_url=model_config.base_url, **kwargs)
elif vendor == "aws":
from langchain_aws import ChatBedrock

model_config = cast(AWSConfig, model_config)

return ChatBedrock(
model_id=model_name,
region_name=model_config.region_name,
**kwargs,
)
elif vendor == "ollama":
from langchain_ollama import ChatOllama

model_config = cast(OllamaConfig, model_config)
return ChatOllama(model=model_name, base_url=model_config.base_url, **kwargs)
else:
raise ValueError(f"Unknown LLM vendor: {vendor}")


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a factory, apart from that lgtm

@maciejmajek maciejmajek merged commit 0eec253 into development Apr 23, 2025
5 checks passed
@maciejmajek maciejmajek deleted the jm/refactor/rai_bench branch April 23, 2025 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants